Search CORE

290 research outputs found

RepSeq-A database of amino acid repeats present in lower eukaryotic pathogens

Author: AL Hughes
AL Price
Daniel P Depledge
Deborah F Smith
DP Depledge
E Pizzi
E Pizzi
EM LeProust
EM Marcotte
G Benson
GP Singh
JA Subirana
JL Clarke
KK Tetteh
MJ Gardner
MK Kalita
MM Alba
MM Alba
MM Alba
MV Katti
PJ Rosenthal
Ryan PJ Lower
S Kruglyak
T Ilg
WW Zhang
Y Kashi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

BACKGROUND Amino acid repeat-containing proteins have a broad range of functions and their identification is of relevance to many experimental biologists. In human-infective protozoan parasites (such as the Kinetoplastid and Plasmodium species), they are implicated in immune evasion and have been shown to influence virulence and pathogenicity. RepSeq http://repseq.gugbe.com is a new database of amino acid repeat-containing proteins found in lower eukaryotic pathogens. The RepSeq database is accessed via a web-based application which also provides links to related online tools and databases for further analyses. RESULTS The RepSeq algorithm typically identifies more than 98% of repeat-containing proteins and is capable of identifying both perfect and mismatch repeats. The proportion of proteins that contain repeat elements varies greatly between different families and even species (3 - 35% of the total protein content). The most common motif type is the Sequence Repeat Region (SRR) - a repeated motif containing multiple different amino acid types. Proteins containing Single Amino Acid Repeats (SAARs) and Di-Peptide Repeats (DPRs) typically account for 0.5 - 1.0% of the total protein number. Notable exceptions are P. falciparum and D. discoideum, in which 33.67% and 34.28% respectively of the predicted proteomes consist of repeat-containing proteins. These numbers are due to large insertions of low complexity single and multi-codon repeat regions. CONCLUSION The RepSeq database provides a repository for repeat-containing proteins found in parasitic protozoa. The database allows for both individual and cross-species proteome analyses and also allows users to upload sequences of interest for analysis by the RepSeq algorithm. Identification of repeat-containing proteins provides researchers with a defined subset of proteins which can be analysed by expression profiling and functional characterisation, thereby facilitating study of pathogenicity and virulence factors in the parasitic protozoa. While primarily designed for kinetoplastid work, the RepSeq algorithm and database retain full functionality when used to analyse other species

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

White Rose Research Online

Towards validating the hypothesis of phylogenetic profiling

Author: D Lin
EM Marcotte
J Handl
J Jäkel
J Seo
J Sun
J Sun
J Wu
M Pellegrini
Mazen Atwi
N Bolshakova
P Resnik
R Loganantharaj
Raja Loganantharaj
RL Tatusov
RL Tatusov
SF Altschul
SV Date
Publication venue: BioMed Central
Publication date: 01/11/2007
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

CODA: Accurate Detection of Functional Associations between Proteins in Eukaryotic Genomes Using Domain Fusion

Author: Adam J. Reid
AJ Enright
AJ Enright
Andrew B. Clegg
B Snel
C von Mering
C Yeats
Christine A. Orengo
CJ Marcotte
DE Barnes
EM Marcotte
F Bellivier
G Apic
I Yanai
Juan A. G. Ranea
K Truong
M Huynen
Magnus Rattray
P Resnik
PM Bowers
PW Lord
RD Finn
RD Finn
S Hoffman
SF Altschul
SK Kummerfeld
TF Smith
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

Background: In order to understand how biological systems function it is necessary to determine the interactions and associations between proteins. Gene fusion prediction is one approach to detection of such functional relationships. Its use is however known to be problematic in higher eukaryotic genomes due to the presence of large homologous domain families. Here we introduce CODA (Co-Occurrence of Domains Analysis), a method to predict functional associations based on the gene fusion idiom.Methodology/Principal Findings: We apply a novel scoring scheme which takes account of the genome-specific size of homologous domain families involved in fusion to improve accuracy in predicting functional associations. We show that CODA is able to accurately predict functional similarities in human with comparison to state-of-the-art methods and show that different methods can be complementary. CODA is used to produce evidence that a currently uncharacterised human protein may be involved in pathways related to depression and that another is involved in DNA replication.Conclusions/Significance: The relative performance of different gene fusion methodologies has not previously been explored. We find that they are largely complementary, with different methods being more or less appropriate in different genomes. Our method is the only one currently available for download and can be run on an arbitrary dataset by the user. The CODA software and datasets are freely available from ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/v6.1.0/CODA/. Predictions are also available via web services from http://funcnet.eu/

CiteSeerX

Public Library of Science (PLOS)

Crossref

PubMed Central

UCL Discovery

Genome wide prediction of protein function via a generic knowledge discovery approach based on evidence integration

Author: A Drawid
A Lagreid
A Tanay
AC Gavin
AJ Enright
B Schwikowski
CJ Roberts
EM Marcotte
EM Marcotte
GD Bader
HJ Bussemaker
HW Mewes
I Cherel
J Ihmels
Jianghui Xiong
Kunyi Luo
LF Wu
M Ashburner
M Deng
M Deng
M Pellegrini
MB Eisen
MC von
MP Brown
OG Troyanskaya
P Jorgensen
P Uetz
PT Spellman
R Kohavi
R Overbeek
SF Altschul
Shanguang Chen
Simon Rayner
T Ito
TR Hazbun
TR Hughes
U Karaoz
WK Huh
WR Pearson
X Zhou
Y Chen
Y Ho
Yinghui Li
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The automation of many common molecular biology techniques has resulted in the accumulation of vast quantities of experimental data. One of the major challenges now facing researchers is how to process this data to yield useful information about a biological system (e.g. knowledge of genes and their products, and the biological roles of proteins, their molecular functions, localizations and interaction networks). We present a technique called Global Mapping of Unknown Proteins (GMUP) which uses the Gene Ontology Index to relate diverse sources of experimental data by creation of an abstraction layer of evidence data. This abstraction layer is used as input to a neural network which, once trained, can be used to predict function from the evidence data of unannotated proteins. The method allows us to include almost any experimental data set related to protein function, which incorporates the Gene Ontology, to our evidence data in order to seek relationships between the different sets. RESULTS: We have demonstrated the capabilities of this method in two ways. We first collected various experimental datasets associated with yeast (Saccharomyces cerevisiae) and applied the technique to a set of previously annotated open reading frames (ORFs). These ORFs were divided into training and test sets and were used to examine the accuracy of the predictions made by our method. Then we applied GMUP to previously un-annotated ORFs and made 1980, 836 and 1969 predictions corresponding to the GO Biological Process, Molecular Function and Cellular Component sub-categories respectively. We found that GMUP was particularly successful at predicting ORFs with functions associated with the ribonucleoprotein complex, protein metabolism and transportation. CONCLUSION: This study presents a global and generic gene knowledge discovery approach based on evidence integration of various genome-scale data. It can be used to provide insight as to how certain biological processes are implemented by interaction and coordination of proteins, which may serve as a guide for future analysis. New data can be readily incorporated as it becomes available to provide more reliable predictions or further insights into processes and interactions

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Gene Function Classification Using Bayesian Models with Hierarchy-Based Priors

Author: A Clare
A McCallum
AS Weigend
B Rost
B Schoikowski
B Shahbaba
Babak Shahbaba
BE Engelhardt
D Koller
EM Marcotte
FR Blattner
H Blockeel
I Tsochantaridis
IUBMB
J DeRisi
J Fox
J Goodman
J Struyf
J Zhang
JA Eisen
JR Guest
K Sjölander
L Cai
L Dehaspe
M Brown
M Deng
M Deng
M Eisen
M Riley
M Riley
N Cesa-Bianchi
O Dekel
P Pavlidis
R Caruana
R Eisner
Radford M Neal
RD King
RD King
RM Neal
RM Neal
RM Neal
S Rison
S Sattath
S Spiro
SF Altschul
ST Dumais
WR Pearson
Z Barutcuoglu
Publication venue
Publication date: 01/01/2006
Field of study

We investigate the application of hierarchical classification schemes to the annotation of gene function based on several characteristics of protein sequences including phylogenic descriptors, sequence based attributes, and predicted secondary structure. We discuss three Bayesian models and compare their performance in terms of predictive accuracy. These models are the ordinary multinomial logit (MNL) model, a hierarchical model based on a set of nested MNL models, and a MNL model with a prior that introduces correlations between the parameters for classes that are nearby in the hierarchy. We also provide a new scheme for combining different sources of information. We use these models to predict the functional class of Open Reading Frames (ORFs) from the E. coli genome. The results from all three models show substantial improvement over previous methods, which were based on the C5 algorithm. The MNL model using a prior based on the hierarchy outperforms both the non-hierarchical MNL model and the nested MNL model. In contrast to previous attempts at combining these sources of information, our approach results in a higher accuracy rate when compared to models that use each data source alone. Together, these results show that gene function can be predicted with higher accuracy than previously achieved, using Bayesian models that incorporate suitable prior information

arXiv.org e-Print Archive

University of Toronto Research Repository

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Identification and Analysis of Co-Occurrence Networks with NetCutter

Author: A Brazma
A Kel
B Snel
BJ Stapley
BP Berman
C Perez-Iratxeta
D Chaussabel
D Rebholz-Schuhmann
DL Wheeler
DR Masys
EM Marcotte
EM Marcotte
Francesco Mancuso
G Dennis Jr
G Finocchiaro
GW Flake
Heiko Müller
J Ding
JD Wren
Ji Zhu
L Tanabe
LA Goodman
M Bansal
M Girvan
M Markstein
M Pellegrini
MA Huynen
ME Newman
ME Newman
MJ Schuemie
MS Halfon
NR Smalheiser
P Sudarsanam
R Elkon
RE Tarjan
RL Tatusov
S Tavazoie
S Zhu
S Zhu
SA Jelinsky
SX Chen
T Manke
TC Rindflesch
TK Jenssen
WW Wasserman
Y Pilpel
Publication venue: Public Library of Science
Publication date: 10/09/2008
Field of study

BACKGROUND: Co-occurrence analysis is a technique often applied in text mining, comparative genomics, and promoter analysis. The methodologies and statistical models used to evaluate the significance of association between co-occurring entities are quite diverse, however. METHODOLOGY/PRINCIPAL FINDINGS: We present a general framework for co-occurrence analysis based on a bipartite graph representation of the data, a novel co-occurrence statistic, and software performing co-occurrence analysis as well as generation and analysis of co-occurrence networks. We show that the overall stringency of co-occurrence analysis depends critically on the choice of the null-model used to evaluate the significance of co-occurrence and find that random sampling from a complete permutation set of the bipartite graph permits co-occurrence analysis with optimal stringency. We show that the Poisson-binomial distribution is the most natural co-occurrence probability distribution when vertex degrees of the bipartite graph are variable, which is usually the case. Calculation of Poisson-binomial P-values is difficult, however. Therefore, we propose a fast bi-binomial approximation for calculation of P-values and show that this statistic is superior to other measures of association such as the Jaccard coefficient and the uncertainty coefficient. Furthermore, co-occurrence analysis of more than two entities can be performed using the same statistical model, which leads to increased signal-to-noise ratios, robustness towards noise, and the identification of implicit relationships between co-occurring entities. Using NetCutter, we identify a novel protein biosynthesis related set of genes that are frequently coordinately deregulated in human cancer related gene expression studies. NetCutter is available at http://bio.ifom-ieo-campus.it/NetCutter/). CONCLUSION: Our approach can be applied to any set of categorical data where co-occurrence analysis might reveal functional relationships such as clinical parameters associated with cancer subtypes or SNPs associated with disease phenotypes. The stringency of our approach is expected to offer an advantage in a variety of applications

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Cinteny: flexible analysis and visualization of synteny and genome rearrangements in multiple organisms

Author: Amit U Sinha
BME Moret
D Sankoff
DA Bader
DL Wheeler
EM Marcotte
G Andelfinger
G Bourque
G Tesler
Jaroslaw Meller
JH Nadeau
JL Bentley
KA Frazer
LD Stein
M Clamp
PA Pevzner
Q Peng
S Hannenhalli
T Hubbard
TF Deluca
X Pan
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Identifying syntenic regions, i.e., blocks of genes or other markers with evolutionary conserved order, and quantifying evolutionary relatedness between genomes in terms of chromosomal rearrangements is one of the central goals in comparative genomics. However, the analysis of synteny and the resulting assessment of genome rearrangements are sensitive to the choice of a number of arbitrary parameters that affect the detection of synteny blocks. In particular, the choice of a set of markers and the effect of different aggregation strategies, which enable coarse graining of synteny blocks and exclusion of micro-rearrangements, need to be assessed. Therefore, existing tools and resources that facilitate identification, visualization and analysis of synteny need to be further improved to provide a flexible platform for such analysis, especially in the context of multiple genomes. RESULTS: We present a new tool, Cinteny, for fast identification and analysis of synteny with different sets of markers and various levels of coarse graining of syntenic blocks. Using Hannenhalli-Pevzner approach and its extensions, Cinteny also enables interactive determination of evolutionary relationships between genomes in terms of the number of rearrangements (the reversal distance). In particular, Cinteny provides: i) integration of synteny browsing with assessment of evolutionary distances for multiple genomes; ii) flexibility to adjust the parameters and re-compute the results on-the-fly; iii) ability to work with user provided data, such as orthologous genes, sequence tags or other conserved markers. In addition, Cinteny provides many annotated mammalian, invertebrate and fungal genomes that are pre-loaded and available for analysis at . CONCLUSION: Cinteny allows one to automatically compare multiple genomes and perform sensitivity analysis for synteny block detection and for the subsequent computation of reversal distances. Cinteny can also be used to interactively browse syntenic blocks conserved in multiple genomes, to facilitate genome annotation and validation of assemblies for newly sequenced genomes, and to construct and assess phylogenomic trees

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

An Improved, Bias-Reduced Probabilistic Functional Gene Network of Baker's Yeast, Saccharomyces cerevisiae

Background: Probabilistic functional gene networks are powerful theoretical frameworks for integrating heterogeneous functional genomics and proteomics data into objective models of cellular systems. Such networks provide syntheses of millions of discrete experimental observations, spanning DNA microarray experiments, physical protein interactions, genetic interactions, and comparative genomics; the resulting networks can then be easily applied to generate testable hypotheses regarding specific gene functions and associations. Methodology/Principal Findings: We report a significantly improved version (v. 2) of a probabilistic functional gene network [1] of the baker's yeast, Saccharomyces cerevisiae. We describe our optimization methods and illustrate their effects in three major areas: the reduction of functional bias in network training reference sets, the application of a probabilistic model for calculating confidences in pair-wise protein physical or genetic interactions, and the introduction of simple thresholds that eliminate many false positive mRNA co-expression relationships. Using the network, we predict and experimentally verify the function of the yeast RNA binding protein Puf6 in 60S ribosomal subunit biogenesis. Conclusions/Significance: YeastNet v. 2, constructed using these optimizations together with additional data, shows significant reduction in bias and improvements in precision and recall, in total covering 102,803 linkages among 5,483 yeast proteins (95% of the validated proteome). YeastNet is available from http://www.yeastnet.org.This work was supported by grants from the N.S.F. (IIS-0325116, EIA-0219061), N.I.H. (GM06779-01,GM076536-01), Welch (F-1515), and a Packard Fellowship (EMM). These agencies were not involved in the design and conduct of the study, in the collection, analysis, and interpretation of the data, or in the preparation, review, or approval of the manuscript.Cellular and Molecular Biolog

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Texas ScholarWorks

Category Theoretic Analysis of Hierarchical Protein Materials and Social Networks

Author: A Fritsch
AL Barabasi
AL Barabasi
B Alberts
BC Pierce
CM Schneider
D Eisenberg
D Taylor
DA Fletcher
David I. Spivak
DB Searls
DI Spivak
E Moggi
E Rodriguez
Elizabeth Wood
EM Marcotte
EM Marcotte
FW Lawvere
GB Olson
H Jeong
H Jeong
H Peterlik
I Lee
J Aizenberg
J Verdasca
JD Currey
K Hofstetter
Laurent Kreplak
M Barr
M Moortgat
Markus J. Buehler
MD Hauser
MJ Buehler
MJ Buehler
MS Szalay
N Chomsky
N Huebsch
NM Pugno
O Mason
P Csermely
P Fratzl
P Nurse
P Wadler
R Brown
R Lakes
R Milo
R Paparcone
R Pastor-Satorras
RC Strohman
RT Oehrle
S Awodey
S Eilenberg
S Keten
SM Lane
SW Cranford
T Ackbarow
Tristan Giesa
WW Powell
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2011
Field of study

Materials in biology span all the scales from Angstroms to meters and typically consist of complex hierarchical assemblies of simple building blocks. Here we describe an application of category theory to describe structural and resulting functional properties of biological protein materials by developing so-called ologs. An olog is like a “concept web” or “semantic network” except that it follows a rigorous mathematical formulation based on category theory. This key difference ensures that an olog is unambiguous, highly adaptable to evolution and change, and suitable for sharing concepts with other olog. We consider simple cases of beta-helical and amyloid-like protein filaments subjected to axial extension and develop an olog representation of their structural and resulting mechanical properties. We also construct a representation of a social network in which people send text-messages to their nearest neighbors and act as a team to perform a task. We show that the olog for the protein and the olog for the social network feature identical category-theoretic representations, and we proceed to precisely explicate the analogy or isomorphism between them. The examples presented here demonstrate that the intrinsic nature of a complex system, which in particular includes a precise relationship between structure and function at different hierarchical levels, can be effectively represented by an olog. This, in turn, allows for comparative studies between disparate materials or fields of application, and results in novel approaches to derive functionality in the design of de novo hierarchical systems. We discuss opportunities and challenges associated with the description of complex biological materials by using ologs as a powerful tool for analysis and design in the context of materiomics, and we present the potential impact of this approach for engineering, life sciences, and medicine.Presidential Early Career Award for Scientists and Engineers (N000141010562)United States. Army Research Office. Multidisciplinary University Research Initiative (W911NF0910541)United States. Office of Naval Research (grant N000141010841)Massachusetts Institute of Technology. Dept. of MathematicsStudienstiftung des deutschen VolkesClark BarwickJacob Luri

arXiv.org e-Print Archive

Public Library of Science (PLOS)

CiteSeerX

DSpace@MIT

Crossref

Directory of Open Access Journals

PubMed Central

Publikationsserver der RWTH Aachen University

False positive reduction in protein-protein interaction predictions using gene ontology annotations

Author: A Bairoch
A Valencia
C Alfarano
C von Mering
CM Deane
DR Rhodes
E Camon
E Camon
E Sprinzak
EM Marcotte
EM Marcotte
FM Couto
G Butland
GR Smith
H Wu
H Zhu
I Albert
IMA Nooren
J Janin
J Wu
J Yu
JBL Bard
JR Bock
JR Bock
K Tu
KV Brinda
L Lu
M Deng
M Hayashida
M Pellegrini
M Strong
Mahmoud A Mahdavi
O Carugo
P Bork
PW Lord
S Li
SL Lo
SV Date
T Dandekar
T Fujimori
T Yamada
The Gene Ontology Consortium
TR Hazbun
U Güldener
V van Noort
X Wu
XJ Zhou
Y Huang
Y Liu
Yen-Han Lin
Publication venue: BioMed Central
Publication date: 01/07/2007
Field of study

Abstract Background Many crucial cellular operations such as metabolism, signalling, and regulations are based on protein-protein interactions. However, the lack of robust protein-protein interaction information is a challenge. One reason for the lack of solid protein-protein interaction information is poor agreement between experimental findings and computational sets that, in turn, comes from huge false positive predictions in computational approaches. Reduction of false positive predictions and enhancing true positive fraction of computationally predicted protein-protein interaction datasets based on highly confident experimental results has not been adequately investigated. Results Gene Ontology (GO) annotations were used to reduce false positive protein-protein interactions (PPI) pairs resulting from computational predictions. Using experimentally obtained PPI pairs as a training dataset, eight top-ranking keywords were extracted from GO molecular function annotations. The sensitivity of these keywords is 64.21% in the yeast experimental dataset and 80.83% in the worm experimental dataset. The specificities, a measure of recovery power, of these keywords applied to four predicted PPI datasets for each studied organisms, are 48.32% and 46.49% (by average of four datasets) in yeast and worm, respectively. Based on eight top-ranking keywords and co-localization of interacting proteins a set of two knowledge rules were deduced and applied to remove false positive protein pairs. The '<it>strength</it>', a measure of improvement provided by the rules was defined based on the signal-to-noise ratio and implemented to measure the applicability of knowledge rules applying to the predicted PPI datasets. Depending on the employed PPI-predicting methods, the <it>strength </it>varies between two and ten-fold of randomly removing protein pairs from the datasets. Conclusion Gene Ontology annotations along with the deduced knowledge rules could be implemented to partially remove false predicted PPI pairs. Removal of false positives from predicted datasets increases the true positive fractions of the datasets and improves the robustness of predicted pairs as compared to random protein pairing, and eventually results in better overlap with experimental results.</p

Crossref

Directory of Open Access Journals

PubMed Central